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DATA STORAGE SYSTEM 



RELATED APPLICATIONS 



This application is a continuation-in-part to 
application number 10/620080, titled "Data Allocation in a 
5 Distributed Storage System" and to application number 
10/620249, titled "Distributed Independent Cache Memory," 
both filed 15 July, 2003, which are incorporated herein 
by reference. 



storage, and specifically to data storage in distributed 
data storage entities. 

BACKGROUND OF THE INVENTION 

A distributed data storage system typically 
15 comprises cache memories that are coupled to a number of 
disks wherein the data is permanently stored. The disks 
may be in the same general location, or be in completely 
different locations. Similarly, the caches may be 
localized or distributed. The storage system is normally 
20 used by one or more hosts external to the system. 

Using more than one cache and more than one disk 
leads to a number of very practical advantages, such as 
protection against complete system failure if one of the 
caches or one of the disks malfunctions. Redundancy may 
25 be incorporated into a multiple cache or multiple disk 
system, so that failure of a cache or a disk in the 
distributed storage system is not apparent to one of the 
external hosts, and has little effect on the functioning 
of the system. 

30 While distribution of the storage elements has 

undoubted advantages, the fact of the distribution 
typically leads to increased overhead compared to a local 
system having a single cache and a single disk. Inter 
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alia, the increased overhead is required to manage the 
increased number of system components, to equalize or 
attempt to equalize usage of the components, to maintain 
redundancy among the components, to operate a backup 
5 system in the case of a failure of one of the components, 
and to manage addition of components to, or removal of 
components from, the system, A reduction in the required 
overhead for a distributed storage system is desirable. 

An article titled "Consistent Hashing and Random 

10 Trees : Distributed Caching Protocols for Relieving Hot 
Spots on the World Wide Web, " by Karger et al., in the 
Proceedings of the 29th ACM Symposium on Theory of 
Computing, pages 654-663, (May 1997), whose disclosure is 
incorporated herein by reference, describes caching 

15 protocols for relieving "hot spots" in distributed 
networks. The article describes a hashing technique known 
as consistent hashing, and the use of a consistent 
hashing function. Such a function allocates objects to 
devices so as to spread the objects evenly over the 

20 devices, so that there is a minimal redistribution of 
objects if there is a change in the devices, and so that 
the allocation is consistent, i.e., is reproducible. The 
article applies a consistent hashing function to read- 
only cache systems, i.e., systems where a client may only 

25 read data from the cache system, not write data to the 
system, in order to distribute input/output requests to 
the systems. A read-only cache system is used in much of 
the World Wide Web, where a typical user is only able to 
read from sites on the Web having such a system, not 

30 write to such sites. 

An article titled "Differentiated Object Placement 
and Location for Self-Organizing Storage Clusters," by 
Tang et al., in Technical Report 2002-32 of the 
University of California, Santa Barbara (November, 2002) , 
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whose disclosure is incorporated herein by reference, 
describes a protocol for managing a storage system where 
components are added or removed from the system. The 
protocol uses a consistent hashing scheme for placement 
5 of small objects in the system. Large objects are placed 
in the system according to a usage-based policy. 

An article titled "Compact, Adaptive Placement 
Schemes for Non-Uniform Capacities," by Brinkmann et al., 
in the August, 2002, Proceedings of the 14 th ACM Symposium 

10 on Parallel Algorithms and Architectures (SPAA) , whose 
disclosure is incorporated herein by reference, describes 
two strategies for distributing objects among a 
heterogeneous set of servers. Both strategies are based 
on hashing systems. 

15 U. S. patent 5, 875, 481 to Ashton, et al., whose 

disclosure is incorporated herein by reference, describes 
a method for dynamic reconfiguration of data storage 
devices. The method assigns a selected number of the data 
storage devices as input devices and a selected number of 

20 the data storage devices as output devices in a 
predetermined input/output ratio, so as to improve data 
transfer efficiency of the storage devices. 

U. S. patent 6,317,815 to Mayer, et al., whose 
disclosure is incorporated herein by reference, describes 

25 a method and apparatus for reformatting a main storage 
device of a computer system. The main storage device is 
reformatted by making use of a secondary storage device 
on which is stored a copy of the data stored on the main 
device . 

30 U. S. patent 6, 434, 666 to Takahashi, et al., whose 

disclosure is incorporated herein by reference, describes 
a memory control apparatus. The apparatus is interposed 
between a central processing unit (processor) and a 
memory device that stores data. The apparatus has a 
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plurality of cache memories to temporarily store data 
which is transferred between the processor and the memory 
device, and a cache memory control unit which selects the 
cache memory used to store the data being transferred. 
5 U. S. patent 6, 453, 404 to Bereznyi, et al., whose 

disclosure is incorporated herein by reference, describes 
a cache system that allocates memory for storage of data 
items by defining a series of small blocks that are 
uniform in size. The cache system, rather than an 

10 operating system, assigns one or more blocks for storage 
of a data item. 

A number of- different types of storage system are 
known in the art. In a storage area network (SAN) data is 
accessed in blocks at a device level, and the data is 

15 transferred in blocks. Typically, the basic unit of data 
organization is a logical unit (LU) which consists of a 
sequence of logical block addresses (LBAs) . 

In a network attached storage (NAS) system, data is 
accessed as file data or file meta-data (parameters of 

20 the file) . The basic unit of organization is typically a 
file. 

In an object storage architecture (OSA) , the basic 
unit of storage is a storage object, which comprises file 
data together with meta-data. The latter comprise storage 
25 attributes such as data layout and usage information. 

Content addressed storage (CAS) is a particular case 
of OSA, designed for data that is intended to be stored 
and not changed. CAS assigns a unique identifier to the 
stored data, the identifier depending on the contents of 
30 the data. 



4 



f 50528S5 



SUMMARY OF THE INVENTION 

In embodiments of the present invention, groups of 
logical addresses are distributed among one or more 
storage devices comprised in a storage system. Each group 
5 of logical addresses is also herein termed a stripe. The 
storage system receives data to be stored therein in 
data-sets, and assigns each data-set a random value 
chosen from a set of different numbers. In some 

embodiments, each data-set comprises a file or other unit 

10 of data created by a file system. The cardinality of the 
set of different numbers is equal to the number of 
stripes. The system delineates each data-set into equal- 
sized partitions, and for each data-set the system 
assigns each partition of the data-set a sequential 

15 partition number . 

The system allocates each partition to a specific 
stripe in accordance with the sequential partition number 
and the random value of the data-set of the partition, so 
as to evenly distribute the partitions among the stripes. 

20 Each partition is stored to the storage device 
corresponding to the partition's allocated stripe. This 
method of allocation ensures substantially even 
distribution of the partitions among the stripes, 
regardless of the size of the partitions, of the relative 

25 sizes of the partitions and the stripes, and of 
differences in sizes of the data-sets. The even 
distribution applies irrespective of the type of data- 
set, which may, for example, be a file or a data block. 

In an embodiment of the present invention, the 

30 stripes are sequentially numbered from 1 to s, where s is 
the number of stripes in the storage system. A set R of 
different numbers, from which the random value is chosen, 
comprises all integral values from 0 to s-1. The storage 
system assigns a random value r e R to each specific 
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data-set that it receives for storage. Each partition, 
numbered p, in the specific data-set is allocated for 
storage in the storage system in the stripe whose number 
is given by (r+p) modulo (s) if (r+p) modulo (s ) ^0 , and in 
5 the stripe number s if (r+p) modulo (s) =0 . 

If the storage system comprises more than one 
storage device, the stripes may be distributed among the 
storage devices by a procedure that provides a balanced 
access to the devices. If a storage device is added to or 

10 removed from the system, the procedure reallocates the 
stripes among the new numbers of devices so that the 
balanced access is maintained. If a device has been 
added, the procedure only transfers stripes to the added 
storage device. If a device has been removed, the 

15 procedure only transfers stripes from the removed storage 
device. In both cases, the only transfers of data that 
occur are of partitions stored at the transferred 
stripes. The procedure thus minimizes data transfer and 
associated management overhead when the number of storage 

20 devices is changed, or when the device configuration is 
changed, while maintaining the balanced access. 

Typically, the storage devices comprise one or more 
slow-access-time, mass-storage devices, and the storage 
system comprises caches, herein also termed interim, 

25 fast-access-time caches, coupled to the mass-storage 
devices. Each cache is assigned a respective range of 
stripes of the mass-storage devices. The storage system 
typically comprises one or more interfaces, which receive 
input/output (10) requests from host processors directed 

30 to specified data-sets and/or partitions of the data- 
sets. The interfaces convert the 10 requests to 
converted-IO-requests directed to the stripes wherein the 
data-sets and/or partitions are allocated, and direct all 
the converted-IO-requests to the caches to which the 
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stripes are assigned. 

Each interface translates the 10 requests into the 
converted-IO-requests by means of a mapping stored at the 
device, the mapping for each interface being 
5 substantially the same. Thus, adding or removing a cache 
from the storage system simply requires updating of the 
mapping stored in each interface. 

The present invention discloses a data allocation 
approach that can be equally well used for storage area 
10 networks, network attached storage systems, or any other 
kind of storage system. The approach is such that 
configuration changes can be easily handled with minimal 
internal data migration for reallocation purposes, while 
preserving a proper workload balance in the system. 
15 There is therefore provided, according to an 

embodiment of the present invention, a method for storing 
data, including: 

distributing a first plurality of groups of logical 
addresses among one or more storage devices; 
20 receiving a second plurality of data-sets containing 

the data to be stored; 

assigning each data-set among the plurality of data- 
sets a number chosen from a first plurality of different 
numbers; 

25 partitioning each data-set into multiple partitions, 

so that each partition among the multiple partitions 

receives a sequential partition number; 

assigning each partition within each data-set to be 

stored at a specific group of logical addresses in 
30 accordance with the sequential partition number of the 

partition and the random number assigned to the data-set; 

and 

storing each partition at the assigned specific 
group of logical addresses. 
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The multiple partitions may include equal size 
partitions . 

The data-sets may include data from at least one of 
a file, file meta-data, a storage object, a data packet, 
5 a video tape, a music track, an image, a database record, 
contents of a logical unit, and an email. 

In an embodiment, the first plurality of groups 
consists of s groups each having a different integral 
group number between 1 and s, the number consists of an 
10 integer r chosen randomly from and including integers 
between 0 and s-1, the sequential partition number 
consists of a positive integer p, and the group number of 
the assigned specific group is (r+p) modulo (s) if 
(r+p) modulo (s) ^0, and s if ( r+p) modulo (s) =0 . 
15 The method may be operative in at least one of a 

storage area network, a network attached storage system, 
and an object storage architecture. 

The number may be chosen by a randomizing function, 
or alternatively by a consistent hashing function. 
20 There is further provided, according to an 

embodiment of the present invention, a method for data 
distribution, including : 

receiving at least part of a data-set containing 

data; 

25 delineating the data into multiple partitions; 

distributing logical addresses among an initial set 
of storage devices so as to provide a balanced access to 
the devices; 

transferring the partitions to the storage devices 
30 in accordance with the logical addresses; 

adding an additional storage device to the initial 
set, thus forming an extended set of the storage devices 
comprising the initial set and the additional storage 
device; and 



50528S5 



redistributing the logical addresses among the 
storage devices in the extended set so as to cause a 
portion of the logical addresses and the partitions 
stored thereat to be transferred from the storage devices 
5 in the initial set to the additional storage device, 
while maintaining the balanced access and without 
requiring a substantial transfer of the logical addresses 
among the storage devices in the initial set. 

The data-set may include data from at least one of a 
10 file, file meta-data, a storage object, a data packet, a 
video tape, a music track, an image, a database record, 
contents of a logical unit, and an email. 

The initial set of storage devices and the 
additional storage device may be operative in at least 
15 one of a storage area network, a network attached storage 
system, and an object storage architecture. 

Distributing the logical addresses may include: 

generating a first plurality of sets of logical 
addresses , 

20 and delineating the data may include: 

assigning the at least part of the data-set a number 
chosen from a first plurality of different numbers; and 

assigning each partition among the multiple 
partitions a sequential partition number, 
25 and transferring the partitions may include: 

storing each partition at one of the sets of logical 
addresses in accordance with the sequential partition 
number of the partition and the number. 

There is further provided, according to an 
30 embodiment of the present invention, a method for data 
distribution, including: 

receiving at least part of a data-set containing 

data; 

delineating the data into multiple partitions; 
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distributing logical addresses among an initial set 
of storage devices so as to provide a balanced access to 
the devices; 

transferring the partitions to the storage devices 
5 in accordance with the logical addresses; 

removing a surplus storage device from the initial 
set, thus forming a depleted set of the storage devices 
comprising the initial set less the surplus storage 
device; and 

10 redistributing the logical addresses among the 

storage devices in the depleted set so as to cause the 
logical addresses of the surplus device and the 
partitions stored thereat to be transferred to the 
depleted set, while maintaining the balanced access and 

15 without requiring a substantial transfer of the logical 
addresses among the storage devices in the depleted set. 

The data-set may include data from at least one of a 
file, file meta-data, a storage object, a data packet, a 
video tape, a music track, an image, a database record, 

20 contents of a logical unit, and an email. 

The initial set of storage devices may be operative 
in at least one of a storage area network, a network 
attached storage system, and an object storage 
architecture . 

25 Distributing the logical addresses may include: 

generating a first plurality of sets of logical 
addresses, 

and delineating the data may include: 

assigning the at least part of the data-set a number 
30 chosen from a first plurality of different numbers; and 

assigning each partition among the multiple 
partitions a sequential partition number, 

and transferring the partitions may include: 

storing each partition at one of the sets of logical 

10 
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addresses in accordance with the sequential partition 
number of the partition and the number. 

There is further provided, according to an 
embodiment of the present invention, a data storage 
5 system, including: 

one or more mass-storage devices, coupled to store 
partitions of data at respective first ranges of logical 
addresses (LAs) ; 

a plurality of interim devices, configured to 
10 operate independently of one another, each interim device 
being assigned a respective second range of the LAs and 
coupled to receive partitions of data from and provide 
partitions of data to the one or more mass-storage 
devices having LAs within the respective second range; 
15 and 

one or more interfaces, which are adapted to receive 
input/output (10) requests from host processors, to 
identify specified partitions of data in response to the 
10 requests, to convert the 10 requests to converted-10- 
20 requests directed to specified LAs in response to the 
specified partitions of data, and to direct all the 
converted-IO-requests to the interim device to which the 
specified LAs are assigned. 

At least one of the mass-storage devices may have a 
25 slow access time, and at least one of the interim devices 
may have a fast access time. 

The one or more mass-storage devices may be coupled 
to provide a balanced access to the first ranges of LAs. 

The storage system may operate in at least one of a 
30 storage area network, a network attached storage system, 
and an object storage architecture. 

There is further provided, according to an 
embodiment of the present invention, a data storage 
system, including : 

11 
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one or more storage devices wherein are distributed 
a first plurality of groups of logical addresses; and 

a processing unit which is adapted to: 

receive a second plurality of data-sets containing 
5 the data to be stored, 

assign each data-set among the plurality of data- 
sets a number chosen from a first plurality of different 
numbers, 

partition each data-set into multiple partitions, so 
10 that each partition among the multiple partitions 
receives a sequential partition number, 

assign each partition within each data-set to be 
stored at a specific group of logical addresses in the 
one or more storage devices in accordance with the 
15 sequential partition number of the partition and the 
number assigned to the data-set, and 

store each partition in the one or more storage 
devices at the assigned specific group of logical 
addresses . 

20 The multiple partitions may include equal size 

partitions . 

The data-sets may include data from at least one of 
a file, file meta-data, a storage object, a data packet, 
a video tape, a music track, an image, a database record, 
25 contents of a logical unit, and an email. 

The first plurality of groups may include s groups 
each having a different integral group number between 1 
and s, the number may include an integer r chosen 
randomly from and including integers between 0 and s-1, 
30 the sequential partition number may include a positive 
integer p, and the group number of the assigned specific 
group may be (r+p)modulo (s) if (r+p)modulo (s) #0, and s if 
(r+p) modulo (s) =0. 

The one or more storage devices and the processing 

12 
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unit may operate in at least one of a storage area 
network, a network attached storage system, and an object 
storage architecture . 

There is further provided, according to an 
5 embodiment of the present invention, data distribution 
apparatus , including : 

an initial set of storage devices among which are 
distributed logical addresses so as to provide a balanced 
access to the devices; 
10 an additional storage device to the initial set, 

thus forming an extended set of the storage devices 
consisting of the initial set and the additional storage 
device; and 

a processor which is adapted to receive at least 

15 part of a data-set containing data, to delineate the data 
into multiple partitions, to transfer the partitions to 
the initial set of storage devices in accordance with the 
logical addresses, to redistribute the logical addresses 
among the storage devices in the extended set so as to 

20 cause a portion of the logical addresses and the 
partitions stored thereat to be transferred from the 
storage devices in the initial set to the additional 
storage device, while maintaining the balanced access and 
without requiring a substantial transfer of the logical 

25 addresses among the storage devices in the initial set. 

The data-set may include data from at least one of a 
file, file meta-data, a storage object, a data packet, a 
video tape, a music track, an image, a database record, 
contents of a logical unit, and an email. 

30 The initial set of storage devices and the 

additional storage device may operate in at least one of 
a storage area network, a network attached storage 
system, and an object storage architecture. 

The logical addresses may include a plurality of 

13 
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sets of logical addresses, and the processor may be 
adapted to: 

assign the at least part of the data-set a number 
chosen from a plurality of different numbers, 
5 assign each partition among the multiple partitions 

a sequential partition number, and 

store each partition at one of the sets of logical 
addresses in accordance with the sequential partition 
number of the partition and the number. 
10 There is further provided, according to an 

embodiment of the present invention, data distribution 
apparatus , including : 

an initial set of storage devices among which are 
distributed logical addresses so as to provide a balanced 
15 access to the devices; 

a depleted set of storage devices, formed by 
subtracting a surplus storage device from the initial 
set; and 

a processor which is adapted to receive at least 
20 part of a data-set containing data, to delineate the data 
into multiple partitions, to transfer the partitions to 
the initial set of storage devices in accordance with the 
logical addresses, to redistribute the logical addresses 
and the partitions stored thereat of the surplus storage 
25 device among the storage devices in the depleted set 
while maintaining the balanced access and without 
requiring a substantial transfer of the logical addresses 
among the storage devices in the depleted set. 

The data-set may include data from at least one of a 
30 file, file meta-data, a storage object, a data packet, a 
video tape, a music track, an image, a database record, 
contents of a logical unit, and an email. 

The initial set of storage devices may be operative 
in at least one of a storage area network, a network 

14 
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attached storage system, and an object storage 
architecture. 

The logical addresses may include a plurality of 
sets of logical addresses, and the processor may be 
5 adapted to: 

assign the at least part of the data-set a number 
chosen from a plurality of different numbers, 

assign each partition among the multiple partitions 
a sequential partition number, and 
10 store each partition at one of the sets of logical 

addresses in accordance with the sequential partition 
number of the partition and the number. 

There is further provided, according to an 
embodiment of the present invention, a method for storing 
15 data, including: 

coupling one or more mass-storage devices to store 
partitions of data at respective first ranges of logical 
addresses (LAs) ; 

configuring a plurality of interim devices to 
20 operate independently of one another; 

assigning each interim device a respective second 
range of the LAs; 

coupling each interim device to receive the 
partitions of data from and provide the partitions of 
25 data to the one or more mass-storage devices having LAs 
within the respective second range; 

receiving input /output (10) requests from host 
processors; 

identifying specified partitions of data in response 
30 to the 10 requests; 

converting the 10 requests to converted-10-requests 
directed to specified LAs in response to the specified 
partitions of data; and 

directing all the converted-10-requests to the 
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interim device to which the specified LAs are assigned. 

At least one of the mass-storage devices may have a 
slow access time, and at least one of the interim devices 
may have a fast access time. 
5 The one or more mass-storage devices may be coupled 

to provide a balanced access to the first ranges of LAs. 

The one or more storage devices and the plurality of 
interim devices may operate in at least one of a storage 
area network, a network attached storage system, and an 
10 object storage architecture. 

There is further provided, according to an 
embodiment of the present invention, a method for data 
distribution, including : 

receiving at least part of a data-set containing 

15 data; 

delineating the data into multiple equal size 
partitions; 

transferring the partitions to an initial set of 
storage devices so as to provide a balanced access to the 
20 devices; 

adding an additional storage device to the initial 
set, thus forming an extended set of the storage devices 
comprising the initial set and the additional storage 
device; and 

25 redistributing the partitions among the storage 

devices in the extended set so as to cause a portion of 
the partitions to be transferred from the storage devices 
in the initial set to the additional storage device, 
while maintaining the balanced access and without 

30 requiring a substantial transfer of the partitions among 
the storage devices in the initial set. 

There is further provided, according to an 
embodiment of the present invention, a method for data 
distribution, including : 
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receiving at least part of a data-set containing 

data; 

delineating the data into multiple equal size 
partitions ; 

5 transferring the partitions to an initial set of 

storage devices so as to provide a balanced access to the 
devices; 

removing a surplus storage device from the initial 
set, thus forming a depleted set of the storage devices 
10 comprising the initial set less the surplus storage 
device; and 

redistributing the partitions stored in the surplus 
device to the depleted set, while maintaining the 
balanced access and without requiring a substantial 
15 transfer of the partitions among the storage devices in 
the depleted set. 

There is further provided, according to an 
embodiment of the present invention, data distribution 
apparatus, including: 
20 an initial set of storage devices; 

an additional storage device to the initial set, 
thus forming an extended set of the storage devices 
comprising the initial set and the additional storage 
device; and 

25 a processor which is adapted to receive at least 

part of a data-set containing data, to delineate the data 
into multiple equal size partitions, to transfer the 
partitions to the initial set of storage devices so as to 
provide a balanced access to the initial set of storage 

30 devices, to redistribute the partitions among the storage 
devices in the extended set so as to cause a portion the 
partitions stored in the initial set to be transferred to 
the additional storage device, while maintaining the 
balanced access and without requiring a substantial 

17 
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transfer of the partitions among the storage devices in 
the initial set. 

There is further provided, according to an 
embodiment of the present invention, data distribution 
apparatus, including : 

an initial set of storage devices; 

a depleted set of storage devices, formed by 
subtracting a surplus storage device from the initial 
set; and 

a processor which is adapted to receive at least 
part of a data-set containing data, to delineate the data 
into multiple equal size partitions, to transfer the 
partitions to the initial set of storage devices so as to 
provide a balanced access to the initial set of storage 
devices, to redistribute the partitions of the surplus 
storage device among the storage devices in the depleted 
set while maintaining the balanced access and without 
requiring a substantial transfer of the partitions among 
the storage devices in the depleted set. 

The present invention will be more fully understood 
from the following detailed description of the preferred 
embodiments thereof, taken together with the drawings, a 
brief description of which is given below. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates distribution of data addresses 
among data storage devices, according to an embodiment of 
the present invention; 
5 Fig. 2 is a flowchart describing a procedure for 

allocating addresses to the devices of Fig. 1, according 
to an embodiment of the present invention; 

Fig. 3 is a flowchart describing an alternative 
procedure for allocating addresses to the devices of Fig. 
10 1, according to an embodiment of the present invention; 

Fig. 4 is a schematic diagram illustrating 
reallocation of addresses when a storage device is 
removed from the devices of Fig. 1, according to an 
embodiment of the present invention; 
15 Fig. 5 is a schematic diagram illustrating 

reallocation of addresses when a storage device is added 
to the devices of Fig. 1, according to an embodiment of 
the present invention; 

Fig. 6 is a flowchart describing a procedure that is 
20 a modification of the procedure of Fig. 2, according to 
an embodiment of the present invention; 

Fig. 7 is a schematic diagram which illustrates a 
fully mirrored distribution of data for the devices of 
Fig. 1, according to an embodiment of the present 
25 invention; 

Fig. 8 is a flowchart describing a procedure for 
performing the distribution of Fig. 7, according to an 
embodiments of the present invention; 

Fig. 9 is a schematic diagram of a storage system, 
30 according to an embodiment of the present invention; 

Fig. 10 is a schematic diagram illustrating 
distribution of data in one or more storage devices of 
the system of Fig. 9; 

Fig. 11 is a schematic diagram illustrating an 

19 
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alternative method of distribution of data D in the 
system of Fig. 9, according to an embodiment of the 
present invention; 

Fig. 12 is a flowchart showing steps performed when 
5 data stored in devices of the system of Fig. 9 is 
redistributed if a device is added to or removed from the 
system, according to an embodiment of the present 
invention; 

Fig. 13 is a flowchart showing steps performed when 
10 data stored in devices of the system of Fig. 9 is 
redistributed if a device is added to or removed from the 
system, according to an alternative embodiment of the 
present invention; 

Fig. 14 is a schematic block diagram of an 
15 alternative storage system, according to an embodiment of 
the present invention; and 

Fig. 15 is a flow chart showing steps followed by 
the system of Fig. 14 on receipt of an input/output 
request, according to an embodiment of the present 
20 invention. 
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DETAILED DESCRIPTION OF EMBODIMENTS 

Reference is now made to Fig. 1, which illustrates 
distribution of data addresses among data storage 
devices, according to an embodiment of the present 
5 invention. A storage system 12 comprises a plurality of 
separate storage devices 14, 16, 18, 20, and 22, also 
respectively referred to herein as storage devices B^, 
B2/ B3, B4, and B5^ and collectively as devices B n . It 
will be understood that system 12 may comprise 

10 substantially any number of physically separate devices, 
and that the five devices B n used herein are by way of 
example. Devices B n comprise any components wherein data 
33, also herein termed data D, may be stored, processed, 
and/or serviced. Examples of devices B n comprise random 

15 access memory (RAM) which has a fast access time and 
which are typically used as caches, disks which typically 
have a slow access time, or any combination of such 
components. A host 24 communicates with system 12 in 
order to read data from, or write data to, the system. A 

20 processor 26 uses a memory 28 to manages system 12 and 
allocate data D to devices B n . It will be appreciated 
that processor 26 may comprise one or more processing 
units, and that some or all of the processing units may 
be centralized or distributed in substantially any 

25 suitable locations, such as within devices B n and/or host 
24. The allocation of data D by processor 26 to devices 
B n is described in more detail below. 

Data D is processed in devices B n at logical 
addresses (LAs) of the devices by being written to the 

30 devices from host 24 and/or read from the devices by host 
24. At initialization of system 12 processor 26 
distributes the LAs of devices B n among the devices using 
one of the pre-defined procedures described below. 
Processor 26 may then store data D at the LAs. 

21 
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In the description of the procedures hereinbelow, 
devices B n are assumed to have substantially equal 
capacities, where the capacity of a specific device is a 
function of the device type. For example, for devices 
5 that comprise mass data storage devices having slow 
access times, such as disks, the capacity is typically 
defined in terms of quantity of data the device may 
store. For devices that comprise fast access time 
memories, such as are used in caches, the capacity is 

10 typically defined in terms of the quantity of data the 
device can store, the throughput rate of the device, or 
both parameters. Those skilled in the art will be able to 
adapt the procedures when devices B n have different 
capacities, in which case ratios of the capacities are 

15 typically used to determine the allocations. The 
procedures allocate groups of one or more LAs to devices 
B n so that balanced access to the devices is maintained, 
where balanced access assumes that taken over 
approximately 10,000xN transactions with devices B n , the 

20 fraction of capacities of devices B n used are equal to 
within approximately 1%, where N is the number of devices 
B n , the values being based on a Bernoulli distribution. 

Fig. 2 is a flowchart describing a procedure 50 for 
allocating LAs to devices B n , according to an embodiment 

25 of the present invention. The LAs are assumed to be 
grouped into k logical stripes/tracks, hereinbelow termed 
stripes 36 (Fig. 1), which are numbered 1, k, where k 

is a whole number. Each logical stripe comprises one or 
more consecutive LAs, and all the stripes have the same 

30 length. Procedure 50 uses a randomizing function to 
allocate a stripe s to devices B n in system 12. The 
allocations determined by procedure 50 are stored in a 
table 32 of memory 28. 

In an initial step 52, processor 26 determines an 

22 
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initial value of s, the total number of active devices 
B n in system 12, and assigns each device B n a unique 
integral identity between 1 and T^. In a second step 54, 
the processor generates a random integer R between 1 and 
5 T^, and allocates stripe s to the device B n corresponding 
to R. In a third step 56, the allocation determined in 
step 54 is stored in table 32. Procedure 50 continues, in 
a step 58, by incrementing the value of s, until all 
stripes of devices B n have been allocated, i.e., until s 

10 > k, at which point procedure 50 terminates. 

Table I below is an example of an allocation table 
generated by procedure 50, for system 12, wherein = 5. 
The identifying integers for each device B n , as 
determined by processor 2 6 in step 52, are assumed to be 

15 1 for Bi, 2 for B2, ... ,5 for B5. 
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6071 
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Table I 



Fig. 3 is a flowchart showing steps of a procedure 
70 using a consistent hashing function to allocate 
5 stripes to devices B n , according to an alternative 
embodiment of the present invention. In an initial step 
72, processor 2 6 determines a maximum number N of devices 
B n for system 12 , and a number of points k for each 
device. The processor then determines an integer M, such 
10 that M » N • k . 

In a second step 74 , processor 26 determines N sets 
J n of k random values S a b, each set corresponding to a 
possible device B n , as given by equations (1) : 



Jl = {SiijS^vjSik} for device Bj; 

J2 = {S2bS22.-»S2k} for device B 25 
15 (1) 

J N = { s Nb s N2 »-.SNk) for device B]sj. 

Each random value S a b is chosen from {0, 1, 2, M- 
1}, and the value of each S a j- ) may not repeat , i.e., each 

value may only appear once in all the sets. The sets of 
20 random values are stored in memory 28. 

In a third step 76, for each stripe s processor 26 

determines a value of s mod(M) and then a value of 

24 
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F(s mod(M)) , where F is a permutation function that 
reassigns the value of s mod(M) so that in a final step 78 
consecutive stripes will generally be mapped to different 
devices B n . 

5 In final step 78, the processor finds, typically 

using an iterative search process, the random value 
chosen in step 74 that is closest to F(s mod(M)) Processor 
26 then assigns the device B n of the random value to 
stripe s, according to equations (1) . 

10 It will be appreciated that procedure 70 illustrates 

one type of consistent hashing function, and that other 
such functions may be used by system 12 to allocate LAs 
to devices operating in the system. All such consistent 
hashing functions are assumed to be comprised within the 

15 scope of the present invention. 

Procedure 70 may be incorporated into memory 28 of 
system 12 (Fig. 1), and the procedure operated by 
processor 26 when allocation of stripes s are required, 
such as when data is to be read from or written to system 

20 12. Alternatively, a table 30 of the results of applying 
procedure 70, generally similar to the first and last 
columns of Table I, may be stored in memory 28, and 
accessed by processor 26 as required. 

Fig. 4 is a schematic diagram illustrating 

25 reallocation of stripes when a storage device is removed 
from storage system 12, according to an embodiment of the 
present invention. By way of example, device B3 is 
assumed to be no longer active in system 12 at a time 
t=l, after initialization time t=0, and the stripes 

30 initially allocated to the device, and any data stored 
therein, are reallocated to the depleted set of devices 
Bi, &2r B 4' B 5 of the system. Device B3 may be no longer 
active for a number of reasons known in the art, such as 
device failure, or the device becoming surplus to the 
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system, and such a device is herein termed a surplus 
device. The reallocation is performed using procedure 50 
or procedure 70, typically according to the procedure 
that was used at time t=0. As is illustrated in Fig- 4, 
5 and as is described below, stripes from device B3 are 
substantially evenly redistributed among devices B^, B2, 
B 4 , B 5 . 

If procedure 50 (Fig. 2) is applied at t=l, the 
procedure is applied to the stripes of device B3, so as 

10 to randomly assign the stripes to the remaining active 
devices of system 12. In this case, at step 52 the total 
number of active devices = 4 , and identifying integers 
for each active device B n are assumed to be 1 for B^, 2 
for B2f 4 for B4, 3 for BsProcessor 26 generates a new 

15 table, corresponding to the first and last columns of 
Table II below for the stripes that were allocated to B3 
at t=0, and the stripes are reassigned according to the 
new table. Table II illustrates reallocation of stripes 
for device B3 (from the allocation shown in Table I) . 

20 
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Table II 



It will be appreciated that procedure 50 only 
generates transfer of stripes from the device that is no 
5 longer active in system 12, and that the procedure 
reallocates the stripes, and any data stored therein, 
substantially evenly over the remaining active devices of 
the system. No reallocation of stripes occurs in system 
12 other than stripes that were initially allocated to 

10 the device that is no longer active. Similarly, no 
transfer of data occurs other than data that was 
initially in the device that is no longer active. Also, 
any such transfer of data may be performed by processor 
26 transferring the data directly from the inactive 

15 device to the reallocated device, with no intermediate 
device needing to be used. 

Similarly, by consideration of procedure 70 (Fig. 
3) , it will be appreciated that procedure 70 only 
generates transfer of stripes, and reallocation of data 

20 stored therein, from the device that is no longer active 
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in system 12 , i.e., device B3 . Procedure 70 reallocates 
the stripes (and thus their data) from B3 substantially 
evenly over the remaining devices B^, B2, B4, B5 of the 
system, no reallocation of stripes or data occurs in 
5 system 12 other than stripes/data that were initially in 
B3, and such data transfer as may be necessary may be 
performed by direct transfer to the remaining active 
devices. It will also be understood that if B3 is 
returned to system 12 at some future time, the allocation 

10 of stripes after procedure 70 is implemented is the same 
as the initial allocation generated by the procedure. 

Fig. 5 is a schematic diagram illustrating 
reallocation of stripes when a storage device is added to 
storage system 12, according to an embodiment of the 

15 present invention. By way of example, a device 23, also 
herein termed device Bg, is assumed to be active in 
system 12 at time t=2, after initialization time t=0, and 
some of the stripes initially allocated to an initial set 
of devices B^, B2, B3, B4, B5, and any data stored 

20 therein, are reallocated to device Bg. The reallocation 
is performed using procedure 70 or a modification of 
procedure 50 (described in more detail below with 
reference to Fig. 6), typically according to the 
procedure that was used at time t=0. As is illustrated in 

25 Fig. 5, and as is described below, stripes from devices 
Bi, B2, B3, B4, B5 are substantially evenly removed from 
the devices and are transferred to device Bg. B^, B2, B3, 
B4, B5, Bg act as an extended set of the initial set. 

Fig. 6 is a flowchart describing a procedure 90 that 

30 is a modification of procedure 50 (Fig. 2), according to 
an alternative embodiment of the present invention. Apart 
from the differences described below, procedure 90 is 
generally similar to procedure 50, so that steps 
indicated by the same reference numerals in both 

28 
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procedures are generally identical in implementation. As 
in procedure 50, procedure 90 uses a randomizing function 
to allocate stripes s to devices B n in system 12, when a 
device is added to the system. The allocations determined 
5 by procedure 90 are stored in table 32 of memory 28. 

Assuming procedure 50 is applied at t=2, at step 52 
the total number of active devices T<j = 6, and 
identifying integers for each active device B n are 
assumed to be 1 for B^, 2 for B2, 3 for B3, 4 for B4, 5 

10 for B5, 6 for Bg. In a step 91 processor 26 determines a 
random integer between 1 and 6. 

In a step 92, the processor determines if the random 
number corresponds to one of the devices present at time 
t=0. If it does correspond, then processor 26 returns to 

15 the beginning of procedure 90 by incrementing stripe s, 
via step 58, and no reallocation of stripe s is made. If 
it does not correspond, i.e., the random number is 6, 
corresponding to device Bg, the stripe is reallocated to 
device Bg. In step 56, the reallocated location is stored 

20 in table 32. Procedure 90 then continues to step 58. 
Table III below illustrates the results of applying 
procedure 90 to the allocation of stripes given in Table 
II. 
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Table III 



It will be appreciated that procedure 90 only 
generates transfer of stripes, and thus reallocation of 
5 data, to device Bg. The procedure reallocates the stripes 
to Bg by transferring stripes, substantially evenly, from 
devices B^, B2, 63, B4, B5 of the system, and no transfer 
of stripes, or data stored therein, occurs in system 12 
other than stripes/data transferred to Bg. Any such data 
10 transfer may be made directly to device Bg, without use 
of an intermediate device B n . 

It will also be appreciated that procedure 70 may be 
applied when. device Bg is added to system 12. 
Consideration of procedure 70 shows that similar results 
15 to those of procedure 90 apply, i.e., that there is only 
reallocation of stripes, and data stored therein, to 

30 
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device Bg. As for procedure 90, procedure 70 generates 
substantially even reallocation of stripes/data from the 
other devices of the system. 

Fig. 7 is a schematic diagram which illustrates a 
5 fully mirrored distribution of data D in storage system 
12 (Fig. 1), and Fig. 8 is a flowchart illustrating a 
procedure 100 for performing the distribution, according 
to embodiments of the present invention. Procedure 100 
allocates each specific stripe to a primary device B^, 

10 and a copy of the specific stripe to a secondary device 
B n 2r nl;*n2, so that each stripe is mirrored. To implement 
the mirrored distribution, in a first step 102 of 
procedure 100, processor 26 determines primary device B n i 
for locating a stripe using procedure 50 or procedure 70. 

15 In a second step 104, processor 26 determines secondary 
device B n 2 for the stripe using procedure 50 or procedure 
70, assuming that device B n i is not available. In a third 
step 106, processor 26 allocates copies of the stripe to 
devices B n i and B n 2/ and writes the device identities to 

20 a table 34 in memory 28, for future reference. Processor 
26 implements procedure 100 for all stripes 36 in devices 
B n . 

Table IV below illustrates devices B n i and B n 2 
determined for stripes 6058 - 6078 of Table I, where 
25 steps 102 and 104 use procedure 50. 
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Table IV 



If any specific device B n becomes unavailable, so 
that only one copy of the stripes on the device is 
5 available in system 12, processor 26 may implement a 
procedure similar to procedure 100 to generate a new 
second copy of the stripes that were on the unavailable 
device. For example, if after allocating stripes 6058 - 
6078 according to Table IV, device B3 becomes 

10 unavailable, copies of stripes 6062, 6065, 6067, and 
6075, need to be allocated to new devices in system 12 to 
maintain full mirroring. Procedure 100 may be modified to 
find the new device of each stripe by assuming that the 
remaining device, as well as device B3, is unavailable. 

15 Thus, for stripe 6062, processor 26 assumes that devices 
B]_ and B3 are unavailable, and determines that instead of 
device B3 the stripe should be written to device B4 . 
Table V below shows the devices that the modified 
procedure 100 determines for stripes 6058, 6060, 6062, 

20 6065, 6072, and 6078, when B3 becomes unavailable. 
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Table V 



It will be appreciated that procedure 100 spreads 
5 locations for stripes 36 substantially evenly across all 
devices B n/ while ensuring that each pair of copies of 
any particular stripe are on different devices, as is 
illustrated in Fig. 7. Furthermore, the even distribution 
of locations is maintained even when one of devices B n , 

10 becomes unavailable- Either copy, or both copies, of any 
particular stripe may be used when host 24 communicates 
with system 12. It will also be appreciated that in the 
event of one of devices B n becoming unavailable, 
procedure 100 regenerates secondary locations for copies 

15 of stripes 36 that are evenly distributed over devices 

B n - 

Referring back to Fig. 1, it will be understood that 
the sizes of tables 30, 32, or 34 are a function of the 
number of stripes in system 12, as well as the number of 

20 storage devices in the system. Some embodiments of the 
present invention reduce the sizes of tables 30, 32, or 
34 by duplicating some of the entries of the tables, by 
relating different stripes mathematically. For example, 
if system 12 comprises 2,000,000 stripes, the same 

25 distribution may apply to every 500,000 stripes, as 
illustrated in Table VI below. Table VI is derived from 
Table I. 
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Table VI 



It will be appreciated that procedures such as those 
described above may be applied substantially 
5 independently to different storage devices, or types of 
devices, of a storage system. For example, a storage 
system may comprise a distributed fast access cache 
coupled to a distributed slow access mass storage. Such a 
storage system is described in more detail in the U. S. 

10 Application titled "Distributed Independent Cache 
Memory," filed on 15 July, 2003, and assigned to the 
assignee of the present invention. The fast access cache 
may be assigned addresses according to procedure 50 or 
modifications of procedure 50, while the slow access mass 

15 storage may be assigned addresses according to procedure 
70 or modifications of procedure 70. 

Fig. 9 is a schematic diagram of a storage system 
118, and Fig. 10 is a schematic diagram illustrating 
distribution of data D to stripes 36 in one or more 

20 storage devices B n of system 118, according to an 
embodiment of the present invention. Apart from the 
differences described below, the operation of system 118 
is generally similar to that of system 12 (Fig. 1) , such 
that elements indicated by the same reference numerals in 

25 both systems 12 and 118 are generally identical in 
construction and in operation. In the example described 
with respect to Figs. 9 and 10, except where otherwise 
stated data D is assumed to be one set 120 of data, 
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typically comprising a single file. Data D is delineated, 
typically by processor 26, into a number of sequential 
partitions 122, each partition 122 comprising an equal 
number of bytes. Specific partitions 122 are also 
5 referred to herein as PI, P2, ... , and generally as 
partitions P. By way of example, data D is assumed to 
comprise 10 Mbytes, which are delineated into 1000 
partitions PI, P2, P1000, each partition comprising 10 

Kbytes . 

10 Processor 26 allocates partitions P to stripes 36 so 

that balanced access to the stripes is maintained. 
Hereinbelow, by way of example there are assumed to be 
100 stripes 36, referred to herein as stripes SI, S2, ... , 
S100, and generally as stripes S, to which partitions P 

15 are allocated. Methods by which processor 26 may 
implement the allocation are described hereinbelow. 

In one method of allocation of partitions P, the 
partitions are allocated to stripes S according to the 
following equations: 

20 

Pn e S(n • mod(lOO)), n • mod(l 00) * 0 ; 

Pn g SI 00, n ■ mod(lOO) =0; ( 2 ) 

ne{l, 2, ...,1000} 



As is illustrated in Fig. 10 when data D is 10 
Mbytes, equations (2) distribute partitions P 
25 substantially evenly over stripes S. 

Equations (2) are a specific case of a generalized 
method for distributing a number p of partitions P over a 
number s of stripes S. Equations (3) are the 
corresponding generalization of equations (2) : 

30 
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Pn g S(n • mod(s)), n • mod(s) * 0; 

Pn e Ss, n • mod(s) = 0; ( 3 ) 

n e {1, 2, .., p} 



Applying equations (3) to data D will implement a 
substantially even distribution for any data D, as long 
5 as p >> s. It will be appreciated that if data D 
comprises more than one set of data, applying equations 
(3) to each of the sets will distribute the data of all 
the sets approximately evenly over stripes S, as long as 
p » s for every set. 

10 Fig. 11 is a schematic diagram illustrating an 

alternative method of distributing of data D to stripes 
36 in one or more storage devices B n of system 118 , 
according to an embodiment of the present invention. In 
the example described with respect to Fig. 11, data D is 

15 assumed to comprise a multiplicity of data-sets Ff of 
data, f = {1, 2, m} , each data-set Ff typically 

comprising one file, although it will be understood that 
a data-set may comprise substantially any group of data. 
Processor 26 delineates each data-set Ff into a number of 

20 partitions 132, each partition 132 comprising an equal 
number of bytes. A general expression used herein for a 
partition of data-set Ff is Pn(Ff), where n is a whole 
number having a maximum value p. The value of p typically 
varies from data-set to data-set, and depends on the 

25 number of bytes in Ff and the size of the partitions into 
which data-sets Ff are delineated. Specific partitions 
132 are P1(F1), P2 (Fl) , P1(F2), P2(F2), ... , Pn(Ff), 

Pl(Fm), P2(Fm), ... Pp(Fm). Partitions 132 are also 
referred to generally herein as partitions P. 

30 In order to distribute partitions P between stripes 

S, processor 26 generates a random positive integral 
offset H ( Ff ) for each data-set Ff. The processor may 
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generate H(Ff) by any randomizing process known in the 
art, such as a hashing function, and sets the value of 
H(Ff) to be any integer between 0 and (s - 1) , where s is 
the number of stripes S. Processor 26 applies the 
5 respective offset H(Ff) to each data-set Ff, and 
allocates each of the partitions of each data-set Ff 
according to the following equations. 



Pn(Ff) e S[(H(Ff ) + n)mod(s)], [(H(Ff) + n)mod(s)] * 0; 

Pn(Ff ) e S[s], [(H(Ff) + n) mod(s)] = 0; ( 4 ) 

ne{l,2,...,p} } f <={l,2,...,m}, H(Ff ) e {0, 1, (s - 1)} 

10 

To illustrate implementation of equations (4), by 
way of example m is assumed equal to five, so that data D 
comprises data-sets Fl, F2, F3, F4, and F5. The data-sets 
are assumed to be delineated into partitions of size 10 

15 Kb. The sizes of data-sets Fl, F2, F3, F4 , and F5 are 
respectively 1.32 Mb, 2.03 Mb, 1.01 Mb, 780 Kb, and 15 
Kb, so that the value of p for each of the data-sets is 
132, 203, 101, 78, and 2. The number of stripes, s, into 
which the partitions are allocated is assumed to be 100. 

20 Processor 26 is assumed to generate the following 

offsets: H(F1) = 70, H(F2) = 99, H(F3) = 0, H(F4) = 25, 
and H(F5) = 40. 

Applying equations (4) to determine to which stripe 
partitions are allocated gives: 

25 For data-set Fl : P1(F1) e S71; ...; P30(F1) € S100; 

P31 (Fl) e SI; P32(F1) e S2; ...; P130(F1) e S100; P131(F1) 
e SI; P132 (Fl) e S2. 

For data-set F2 : P1(F2) e S100; P2(F2) e SI; P3(F2) 
e S2; P4(F2) € S3; ... P201(F2) € S100; P202(F2) e SI; 

30 P203(F2) e S2. 

For data-set F3 : P1(F3) e SI; P2(F3) e S2; P3(F3) e 
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S3; ... P100(F3) e S100; P101(F3) e SI. 

For data-set F4 : P1(F4) e S26; ... P75(F4) e S100; 
P76(F4) g SI; P77(F4) e S2; P78(F4) e S3. 

For data-set F5: P1(F5) e S41; P2(F5) e S42. 
5 It will be appreciated that in general equations (4) 

distribute partitions P substantially evenly over stripes 
S, the distribution being independent of the size of the 
partitions and of the relation of the number of 
partitions to the number of stripes. It will also be 
10 appreciated that while in the examples above stripes S 
are sequential, the allocation of the stripes to physical 
devices B n typically spreads the individual stripes over 
devices B n . 

Equations (2) or (3) may be implemented by storing 

15 one or more procedures 35 (Fig. 9), corresponding to the 
equations, in memory 28. Equations (4) may be implemented 
by storing one or more procedures 39 corresponding to the 
equations in memory 28 , together with a table 41 of 
random integral offsets H(Ff) for each data-set Ff. 

20 Alternatively, tables corresponding to the results of 
procedures 35 and/or 39 may be stored in memory 28. 
Processor 26 uses the procedures and/or tables when 
accessing the data, typically for storage and/or 
retrieval of data, in order to determine the stripe 

25 corresponding to a required partition. 

Equations (2), (3), and (4) are examples of methods 
for distributing partitions of data-sets among stripes, 
using a combination of a random number and a sequential 
partition number to determine to which stripe a specific 

30 partition is allocated, and performing the allocation so 

that the partitions are evenly distributed among the 

stripes. The random number is chosen from a set of 

different numbers, the cardinality of the set being 

assigned to be equal to the number of stripes. All such 
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methods for distributing partitions evenly among stripes, 
using a sequential partition number and numbers chosen 
randomly from a set of different numbers, the set having 
a cardinality equal to the number of stripes, are assumed 
5 to be comprised within the scope of the present 
invention . 

Fig. 12 is a flowchart 140 showing steps performed 
when data D, stored in devices B n of system 118, is 
redistributed if a device is added to the system, or if a 

10 device is removed from the system, according to an 
embodiment of the present invention. 

In a first step 142, processor 26 allocates stripes 
S of devices B n according to one of the methods described 
above with respect to Fig. 2, Fig. 3, or Fig. 8. 

15 In a second step 144, the processor delineates data 

D into equal size partitions. The processor then 
allocates the partitions to stripes S according to 
equations (3) or (4), using procedures 35, 39 and/or 
tables as described above. 

20 In a third step 146, the processor stores the 

partitions to devices B n according to the stripes 
determined in the second step. 

If a device is added to system 118, in a fourth step 
148, processor 26 reallocates the stripes of existing 

25 devices to the added device, as described above with 
respect to Fig. 5. In a fifth step 150, partitions 
corresponding to the reallocated stripes are stored to 
the added device. 

If a device is removed from system 118, in a sixth 

30 step 152 processor 26 reallocates the stripes of the 
removed device to the remaining devices, as described 
above with respect to Fig. 4. In a seventh step 154, 
partitions corresponding to the reallocated stripes are 
stored to the remaining devices, in accordance with the 
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reallocated stripes . 

After step 150 or 154, the flowchart ends. 
The first three steps of flowchart 140 (steps 142, 
144, and 146) use two distribution processes to ensure 
5 even distribution of data over devices B n . Step 142 
distributes the stripes substantially evenly and randomly 
over the devices, and step 144 distributes the partitions 
substantially evenly and randomly over the stripes. The 
process used in step 142 is then typically used if, in 
10 steps 148 or 152, a device is added or removed, the 
process ensuring that the least amount of data transfer 
occurs because of the addition or removal. 

Some embodiments of the present invention store data 
D using one randomizing process. An example of such a 
15 process is described with respect to Fig. 13 below. 

Fig. 13 is a flowchart 160 showing steps performed 
when data D, stored in devices B n of system 118, is 
redistributed if a device is added to the system, or if a 
device is removed from the system, according to an 
20 alternative embodiment of the present invention. Data D 
may be in the form of one or more data-sets, as 
exemplified by Figs. 10 and 11. 

In a first step 162, processor 26 allocates stripes 
S of devices B n according to any convenient manner, 
25 typically a non-random manner. For example, if five 
devices B n comprise 100 stripes, device Bi is allocated 
stripes 1 to 20, device B2 is allocated stripes 21 to 40, 
device B5 is allocated stripes 81 to 100. 

In a second step 164, processor 26 delineates data D 
30 into equal size partitions. The processor then allocates 
the partitions to stripes S according to one of the 
randomizing or consistent hashing procedures described 
above with respect to Fig. 2, Fig. 3, or Fig. 8. The 
allocation typically generates an allocation table, 
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similar to table I, having a first column as the 
partition number, and last columns as the stripe number 
and corresponding device number. The allocation table 
thus gives a relationship between each partition number 
5 and its stripe number, and is stored as a look-up table 
43 in memory 28, for use by processor 26 in accessing the 
partitions. Table VII below illustrates generation of 
table 43. Alternatively or additionally, a procedure 45 
using a consistent hashing function, similar to the 
10 consistent hashing functions described above, is stored 
in memory 28, for use generate the relationship 

In a third step 166, processor 26 stores the 
partitions to stripes, according to the relationship of 
step 164. 

15 If a device is added to system 118, in a fourth step 

168, processor 26 reallocates partitions stored in 
existing devices to stripes of the added device. The 
reallocation is performed in a generally similar manner, 
mutatis mutandis , to the method described above with 

20 respect to Fig. 5. In a fifth step 170, reallocated 
partitions are stored to the stripes of the added device. 

If a device is removed from system 118, in a sixth 
step 172 processor 26 reallocates partitions stored in 
the removed device to stripes of the remaining devices. 

25 The reallocation is performed in a generally similar 
manner, mutatis mutandis , to the method described above 
with respect to Fig. 4. In a seventh step 174, 
reallocated partitions are stored to the stripes of the 
remaining devices, in accordance with the reallocation 

30 determined in step 172. 

After step 170 or 174, flowchart 160 ends. 
Table VII below illustrates generation of table 43 
for data D corresponding to one set 120 of data (Fig. 
10) . Table VII assumes that partitions P are stored to 
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100 stripes 36, referred to herein as stripes SI, S2, ... , 
S100, and the stripes have been evenly pre-allocated to 
five devices B^, ... B5. A random number between 1 and 100 
is used to allocate a partition to a stripe. 
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Table VII illustrates a relationship between 
partitions and stripes for a single set of data, using a 
10 random number generator. Those skilled in the art will be 
able to adapt the procedures described herein for 
generating table VII using a consistent hashing function, 
and/or in the case of data D comprising more than one 
data-set . 

15 Fig. 14 is a schematic block diagram of an 

alternative storage system 210, according to an 

embodiment of the present invention. System 210 acts as a 

data memory for one or more host processors 252, which 

are coupled to the storage system by any means known in 
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the art, for example, via a network such as the Internet 
or by a bus. Herein, by way of example, hosts 252 and 
system 210 are assumed to be coupled by a network 250. 
The data stored within system 210 is stored at stripes 
5 251 in one or more slow access time mass storage devices, 
hereinbelow assumed to be one or more disks 212, by way 
of example. The data is typically stored and accessed as 
partitions of data-sets. A system manager 254 acts as a 
control unit for the system. It will be appreciated that 
10 manager 254 may comprise one or more processing units, 
and that some or all of the processing units may be 
centralized or distributed in substantially any suitable 
locations, such as within elements of system 210 and/or 
hosts 252. 

15 System 210 comprises one or more substantially 

similar interfaces 226 which receive input/output (10) 
access requests for data in disks 212 from hosts 252. 
Each interface 22 6 may be implemented in hardware and/or 
software, and may be located in storage system 210 or 

20 alternatively in any other suitable location, such as an 
element of network 250 or one of host processors 252. 
Between disks 212 and the interfaces are a plurality of 
interim devices, also termed herein interim caches 220, 
each cache 220 comprising memory having fast access time, 

25 and each cache being at an equal level hierarchically. 
Each cache 220 typically comprises random access memory 
(RAM) , such as dynamic RAM, and may also comprise 
software. Caches 220 are coupled to interfaces 226 by any 
suitable fast coupling system known in the art, such as a 

30 bus or a switch, so that each interface is able to 
communicate with, and transfer data to and from, any 
cache. Herein the coupling between caches 220 and 
interfaces 226 is assumed, by way of example, to be by a 
first cross-point switch 214. Interfaces 226 operate 
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substantially independently of each other. Caches 220 and 
interfaces 226 operate as a data-set transfer system 227, 
transferring data-sets and/or partitions of data-sets 
between hosts 252 and disks 212. 
5 Caches 220 are typically coupled to disks 212 by a 

fast coupling system. The coupling between the caches and 
the disks may be by a "second plurality of caches to 
first plurality of disks' 7 coupling, herein termed an 
"all-to-all" coupling, such as a second cross-point 

10 switch 224. Alternatively, one or more subsets of the 
caches may be coupled to one or more subsets of the 
disks. Further alternatively, the coupling may be by a 
"one-cache-to-one-disk" coupling, herein termed a "one- 
to-one" coupling, so that one cache communicates with one 

15 disk. The coupling may also be configured as a 
combination of any of these types of coupling. Disks 212 
operate substantially independently of each other. 

At setup of system 210 system manager 254 assigns a 
range of stripes to each cache 220. Manager 254 may 

20 subsequently reassign the ranges during operation of 
system, and an example of steps to be taken in the event 
of a cache change is described in application number 
10/620249. The ranges are chosen so that the complete 
memory address space of disks 212 is covered, and so that 

25 each stripe is mapped to at least one cache; typically 
more than one is used for redundancy purposes. The 
assigned ranges for each cache 220 are typically stored 
in each interface 226 as a substantially similar table, 
and the table is used by the interfaces in routing 10 

30 requests from hosts 252 to the caches. Alternatively or 
additionally, the assigned ranges for each cache 220 are 
stored in each interface 226 as a substantially similar 
function, such as the function exemplified by equations 
(1) above. Further alternatively, any other suitable 
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method known in the art for generating a correspondence 
between ranges and caches may be incorporated into 
interfaces 226. Hereinbelow, the correspondence between 
caches and ranges is referred to as stripe-cache mapping 
5 228, and it will be understood that mapping 228 gives 
each interface 226 a general overview of the complete 
cache address space of system 210. 

In system 210, each cache 220 contains a partition 
location table 221 specific to the cache. Each partition 

10 location table 221 gives its respective cache exact 
location details, on disks 212, for partitions of the 
range of stripes assigned to the cache. Partition 
location table 221 may be implemented as software, 
hardware, or a combination of software and hardware. The 

15 operations of a table similar to partition location table 
221, and also of a mapping similar to mapping 228, are 
explained in more detail in application 10/620249. 

Fig. 15 is a flow chart showing steps followed by 
system 210 on receipt of an 10 request from one of hosts 

20 252, according to an embodiment of the present invention. 
Each 10 request from a specific host 252 comprises 
several parameters, such as whether the request is a read 
or a write command, and which partitions and/or data-sets 
are included in the request. 

25 In an initial step 300, the 10 request is 

transmitted to system 210 according to a protocol under 
which the hosts and the system are operating. The request 
is received by system 210 at one of interfaces 226, 
herein, for clarity, termed the request-receiving 

30 interface (RRI) interface. 

In a stripe identification step 302, the RRI 
interface identifies from the request which partitions 
and/or data-sets are to be read, or which partitions 
and/or data-sets are to be written to. The RRI interface 
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then determines the stripes corresponding to the 
identified partitions and/or data-sets. 

In a cache identification step 304, the RRI 
interface refers to its mapping 228 to determine the 
5 caches corresponding to stripes determined in the step 
302. For each stripe so determined, the RRI interface 
transfers a respective partition and/or data-set request 
to the corresponding cache. It will be understood that 
each partition and/or data-set request is a read or a 

10 write command, according to the originating 10 request. 

In a cache response step 306, each cache 220 
receiving a partition and/or data-set request from the 
RRI interface responds to the request. The response is a 
function of, inter alia,, the type of request, i.e., 

15 whether the request is a read or a write command and 
whether the request is a "hit" or a "miss." Thus, a 
partition and/or data-set may be written to one or more 
disks 212 from the cache and/or read from one or more 
disks 212 to the cache. A partition and/or data-set may 

20 also be written to the RRI from the cache and/or read 
from the RRI to the cache. If the response includes 
writing to or reading from a disk 212, the cache uses its 
partition location table 221 to determine the location on 
the corresponding disk of the partition and/or data-set. 

25 As stated in the Background of the Invention, there 

are a number of different types of data storage system 
known in the art, the systems differing, inter alia, in 
the basic unit of storage that is used. For example, SAN 
systems use logical units (LUs) , and NAS systems use 

30 files. It will be appreciated that embodiments of the 
present invention may be used substantially regardless of 
the type of storage system that is implemented. For 
example, referring back to Fig. 11, sets of data Fl, F2, 
F3, ... may comprise sets of files, or sets of file meta- 
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data, so that system 118 may operate within a NAS system. 
Alternatively, sets of data Fl, F2, F3, ... may comprise 
sets of storage objects, so that system 118 may operate 
within an OSA system or within a CAS system. Furthermore, 
5 sets of data Fl, F2, F3, ... may comprise other 
classifications of data known in the art, such as data 
comprising a data packet, a video tape, a music track, an 
image, a database record, contents of a logical unit, 
and/or an email. 

10 It will be appreciated that the embodiments 

described above are cited by way of example, and that the 
present invention is not limited to what has been 
particularly shown and described hereinabove. Rather, the 
scope of the present invention includes both combinations 

15 and subcombinations of the various features described 
hereinabove, as well as variations and modifications 
thereof which would occur to persons skilled in the art 
upon reading the foregoing description and which are not 
disclosed in the prior art. 
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